Introduction

Bike sharing systems generate strong daily and seasonal patterns that are closely tied to weather and commuting behaviour.

The goal of this project is to predict hourly bike rental demand (count) and to understand which factors matter most for usage.

Accurate forecasts can help operators plan fleet allocation, staff shifts, and maintenance more efficiently.

I work with the Kaggle “Bike Sharing Demand” dataset (2011–2012), focusing on hourly demand, weather, and calendar effects.

Data

The raw training data contain an hourly timestamp (datetime), categorical indicators for season, holiday, workingday, and weather, continuous weather variables including temp, atemp, humidity, and windspeed, and three demand variables: casual, registered, and their sum count, which is the target of interest.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(visdat)
library(skimr)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## 
## The following object is masked from 'package:datasets':
## 
##     sleep
library(ggplot2)
library(dplyr)
library(recipes)
## 
## Attaching package: 'recipes'
## 
## The following object is masked from 'package:VIM':
## 
##     prepare
## 
## The following object is masked from 'package:stringr':
## 
##     fixed
## 
## The following object is masked from 'package:stats':
## 
##     step
library(readr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.1
## ✔ parsnip      1.3.2     ✔ yardstick    1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()  masks purrr::discard()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ recipes::fixed()   masks stringr::fixed()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ recipes::prepare() masks VIM::prepare()
## ✖ car::recode()      masks dplyr::recode()
## ✖ car::some()        masks purrr::some()
## ✖ yardstick::spec()  masks readr::spec()
## ✖ recipes::step()    masks stats::step()
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(Metrics)
## 
## Attaching package: 'Metrics'
## 
## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape
library(patchwork)
train <- read_csv('/Users/sam/bikesharingdemand/data/train.csv')
## Rows: 10886 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (11): season, holiday, workingday, weather, temp, atemp, humidity, wind...
## dttm  (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

season, holiday, working day, weather = factor.

str(train)
## spc_tbl_ [10,886 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ datetime  : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
##  $ season    : num [1:10886] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num [1:10886] 0 0 0 0 0 ...
##  $ casual    : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   datetime = col_datetime(format = ""),
##   ..   season = col_double(),
##   ..   holiday = col_double(),
##   ..   workingday = col_double(),
##   ..   weather = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   humidity = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   count = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(train)
##     datetime                       season         holiday       
##  Min.   :2011-01-01 00:00:00   Min.   :1.000   Min.   :0.00000  
##  1st Qu.:2011-07-02 07:15:00   1st Qu.:2.000   1st Qu.:0.00000  
##  Median :2012-01-01 20:30:00   Median :3.000   Median :0.00000  
##  Mean   :2011-12-27 05:56:22   Mean   :2.507   Mean   :0.02857  
##  3rd Qu.:2012-07-01 12:45:00   3rd Qu.:4.000   3rd Qu.:0.00000  
##  Max.   :2012-12-19 23:00:00   Max.   :4.000   Max.   :1.00000  
##    workingday        weather           temp           atemp      
##  Min.   :0.0000   Min.   :1.000   Min.   : 0.82   Min.   : 0.76  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:13.94   1st Qu.:16.66  
##  Median :1.0000   Median :1.000   Median :20.50   Median :24.24  
##  Mean   :0.6809   Mean   :1.418   Mean   :20.23   Mean   :23.66  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:26.24   3rd Qu.:31.06  
##  Max.   :1.0000   Max.   :4.000   Max.   :41.00   Max.   :45.45  
##     humidity        windspeed          casual         registered   
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 47.00   1st Qu.: 7.002   1st Qu.:  4.00   1st Qu.: 36.0  
##  Median : 62.00   Median :12.998   Median : 17.00   Median :118.0  
##  Mean   : 61.89   Mean   :12.799   Mean   : 36.02   Mean   :155.6  
##  3rd Qu.: 77.00   3rd Qu.:16.998   3rd Qu.: 49.00   3rd Qu.:222.0  
##  Max.   :100.00   Max.   :56.997   Max.   :367.00   Max.   :886.0  
##      count      
##  Min.   :  1.0  
##  1st Qu.: 42.0  
##  Median :145.0  
##  Mean   :191.6  
##  3rd Qu.:284.0  
##  Max.   :977.0
skim(train)
Data summary
Name train
Number of rows 10886
Number of columns 12
_______________________
Column type frequency:
numeric 11
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season 0 1 2.51 1.12 1.00 2.00 3.00 4.00 4.00 ▇▇▁▇▇
holiday 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
workingday 0 1 0.68 0.47 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
weather 0 1 1.42 0.63 1.00 1.00 1.00 2.00 4.00 ▇▃▁▁▁
temp 0 1 20.23 7.79 0.82 13.94 20.50 26.24 41.00 ▂▇▇▇▁
atemp 0 1 23.66 8.47 0.76 16.66 24.24 31.06 45.45 ▁▆▇▇▁
humidity 0 1 61.89 19.25 0.00 47.00 62.00 77.00 100.00 ▁▃▇▇▅
windspeed 0 1 12.80 8.16 0.00 7.00 13.00 17.00 57.00 ▇▆▂▁▁
casual 0 1 36.02 49.96 0.00 4.00 17.00 49.00 367.00 ▇▁▁▁▁
registered 0 1 155.55 151.04 0.00 36.00 118.00 222.00 886.00 ▇▃▁▁▁
count 0 1 191.57 181.14 1.00 42.00 145.00 284.00 977.00 ▇▃▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
datetime 0 1 2011-01-01 2012-12-19 23:00:00 2012-01-01 20:30:00 10886
vis_miss(train)

train <- train %>%
  mutate(
    year     = year(datetime),
    month    = month(datetime),
    day      = day(datetime),
    hour     = hour(datetime),
    wday_num = wday(datetime),
    wday     = wday(datetime, label = TRUE),
    log_count = log(count + 1)
  )
train <- train %>%
  mutate(
    season      = factor(season,
                         levels = c(1,2,3,4),
                         labels = c("Winter","Spring","Summer","Fall")),
    holiday     = factor(holiday, levels = c(0,1), labels=c("No","Yes")),
    workingday  = factor(workingday, levels = c(0,1), labels=c("No","Yes")),
    weather     = factor(weather, 
                         levels = c(1,2,3,4),
                         labels = c("Clear","Mist","Light Snow/Rain","Heavy Rain/Snow")),
        year   = factor(year),
    month  = factor(month, levels = 1:12, 
                    labels = c("Jan","Feb","Mar","Apr","May","Jun",
                               "Jul","Aug","Sep","Oct","Nov","Dec")),
    day    = factor(day, levels = 1:31),
    hour   = factor(hour, levels = 0:23),
    wday_num = factor(wday_num, levels = 1:7,
                      labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")),
    wday = factor(wday, 
                  levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"))
  )
str(train)
## tibble [10,886 × 19] (S3: tbl_df/tbl/data.frame)
##  $ datetime  : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
##  $ season    : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ workingday: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weather   : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num [1:10886] 0 0 0 0 0 ...
##  $ casual    : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
##  $ year      : Factor w/ 2 levels "2011","2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month     : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ wday_num  : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ wday      : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ log_count : num [1:10886] 2.833 3.714 3.497 2.639 0.693 ...
p1 <- ggplot(train, aes(count)) +
  geom_histogram(bins = 40, fill="grey") +
  labs(title="Distribution of count")

p2 <- ggplot(train, aes(log_count)) +
  geom_histogram(bins = 40, fill="grey") +
  labs(title="Distribution of log(count+1)")

gridExtra::grid.arrange(p1, p2, ncol=2)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
bc <- boxcox(count ~ 1, data = train)

lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
## [1] 0.3030303
lambda <- 0.3

train$bc_count <- ((train$count + 1)^lambda - 1) / lambda

ggplot(train, aes(bc_count)) +
  geom_histogram(bins = 40, fill="grey") +
  labs(title = "Box-Cox transformed count (lambda = 0.3)")

I started with converting categorical variables to factors and original timestamp was decomposed into year, month, day, and hour in order to capture distinct temporal demand patterns at different time scales. Since the response variable count shows a severe right-skewed shape, I intially applied a log+1 transformation. To further stabilise variance and approximate normality, I applied Box-Cox transformation and the optimal lambda was approximately 0.3. After applying this transformation, the target distribution became substantially closer to a normal shape.

EDA plots

# hourly demand
train %>%
  group_by(hour) %>%
  summarise(mean_count = mean(count)) %>%
  ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Average count by hour of day",
       x = "hour")

Hourly demand exhibits a strong commuting pattern. The demand increases sharply between 6 and 8 AM, peaks during this period, and declines around 9 AM. The demand rises again from approximately 4 PM, reaching another peak at 5–6 PM before dropping rapidly after 7 PM. Although nighttime demand is relatively low, it is not negligible. Overall, the dominant demand is concentrated around commuting hours, with a noticeable secondary increase around midday.

# wroking day and hour
train %>%
  group_by(workingday, hour) %>%
  summarise(mean_count = mean(count), .groups="drop") %>%
  ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count, colour=workingday)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks=0:23) +
  labs(title="Hourly demand pattern by workingday")

When I separated working days from non-working days, distinct behavioral patterns emerged. On working days, demand is heavily concentrated around morning and evening commuting peaks. On weekends, demand increases steadily from the morning, remains high through the afternoon, and gradually declines in the late afternoon and evening.

# week day and hour
train %>%
  group_by(wday, hour) %>%
  summarise(mean_count = mean(count), .groups="drop") %>%
  ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count, colour=wday)) +
  geom_line() +
  scale_x_continuous(breaks=0:23) +
  labs(title="Hourly demand by day of week")

The day-of-week effects beyond the weekday–weekend distinction seem very weak. Monday through Friday follow nearly identical patterns, as do Saturday and Sunday. This suggests that a binary working-day indicator is sufficient, and detailed weekday dummies add only little explanatory value.

# Seasons and count
ggplot(train, aes(season, count, fill=season)) +
  geom_boxplot() +
  labs(title="Season vs Count")

Seasonal effects are also pronounced. Demand is lowest during winter and highest during summer, reflecting temperature-driven cycling behaviour. Because season and month share overlapping information, including both simultaneously in the model would introduce redundancy and multicollinearity. Therefore, one of them can be safely removed during modelling.

# holiday and working day
p_hol <- ggplot(train, aes(holiday, count, fill=holiday)) +
  geom_boxplot() + labs(title="Holiday vs count")

p_work <- ggplot(train, aes(workingday, count, fill=workingday)) +
  geom_boxplot() + labs(title="Workingday vs count")

gridExtra::grid.arrange(p_hol, p_work, ncol=2)

Holiday effects show an interesting contrast. Outside of holiday periods, overall demand tends to be higher. During holidays, weekday demand exceeds weekend demand, suggesting that commuting-related usage dominates even in holiday periods. While median demand levels do not differ dramatically, the interquartile range and upper quartiles show clear separation, indicating that high-demand events are more frequent on working days.

# Weather impact
ggplot(train, aes(weather, count, fill=weather)) +
  geom_boxplot() +
  labs(title="Weather vs count")

As expected, worsening weather conditions lead to a clear reduction in bike demand. Heavier rain or snow is associated with markedly lower demand.

p1 <- ggplot(train, aes(casual)) +
  geom_histogram(bins = 40, fill="skyblue") +
  labs(title="Distribution of casual users")

p2 <- ggplot(train, aes(registered)) +
  geom_histogram(bins = 40, fill="orange") +
  labs(title="Distribution of registered users")

p1+p2

train %>%
  group_by(hour) %>%
  summarise(
    casual_mean = mean(casual),
    registered_mean = mean(registered)
  ) %>%
  pivot_longer(cols = c(casual_mean, registered_mean),
               names_to = "type",
               values_to = "mean_val") %>%
  ggplot(aes(x = as.numeric(as.character(hour)), y = mean_val, color = type)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Hourly usage pattern: casual vs registered",
       x = "hour", y = "mean count")

train %>%
  group_by(workingday) %>%
  summarise(
    casual_mean = mean(casual),
    registered_mean = mean(registered)
  ) %>%
  pivot_longer(cols = c(casual_mean, registered_mean),
               names_to="type",
               values_to="mean_val") %>%
  ggplot(aes(x=workingday, y=mean_val, fill=type)) +
  geom_col(position="dodge") +
  labs(title="Casual vs Registered by Workingday")

train %>%
  group_by(workingday, hour) %>%
  summarise(
    casual_mean = mean(casual),
    registered_mean = mean(registered),
    .groups="drop"
  ) %>%
  pivot_longer(cols = c(casual_mean, registered_mean),
               names_to = "type",
               values_to = "mean_val") %>%
  ggplot(aes(
    x = as.numeric(as.character(hour)),
    y = mean_val,
    color = type
  )) +
  geom_line(size = 1) +
  facet_wrap(~ workingday, ncol = 1,
             labeller = labeller(workingday = c("No" = "Weekend", "Yes" = "Weekday"))) +
  scale_x_continuous(breaks = 0:23) +
  labs(
    title = "Hourly pattern for casual vs registered on weekend vs weekday",
    x = "Hour",
    y = "Mean count"
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The majority of bike usage comes from registered users, whose behaviour largely determines the overall demand structure. Registered users primarily ride for commuting purposes, which explains the sharp weekday peaks. Even on weekends, registered users account for more trips than casual users. This suggests that individuals who regularly commute by bike also tend to ride recreationally on weekends. Casual users, in contrast, are almost absent on weekdays and mainly appear on weekends. I have checked that the total demand (count) is the sum of casual and registered. So these variables can be safely removed. Including these variables in predictive models would allow the model to reconstruct the target directly rather than learning meaningful relationships. Therefore, both variables were excluded from all models.

#numerical variables 
p_temp <- ggplot(train, aes(temp)) + geom_histogram(bins=40)
p_atemp <- ggplot(train, aes(atemp)) + geom_histogram(bins=40)
p_hum   <- ggplot(train, aes(humidity)) + geom_histogram(bins=40)
p_wind  <- ggplot(train, aes(windspeed)) + geom_histogram(bins=40)

gridExtra::grid.arrange(p_temp, p_atemp, p_hum, p_wind, ncol=2)

# corr matrix
train %>%
  dplyr::select(temp, atemp, humidity, windspeed, count) %>%
  GGally::ggcorr(label = TRUE)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Temperature (temp) represents actual air temperature, while atemp reflects perceived feels-like temperature. Both variables show clear seasonality and several discrete peaks, likely due to rounded measurements. Their distributions are approximately normal and align well with real-world cycling behaviour. Demand is highest at moderate temperatures and decreases at extreme cold or heat. Because cycling exposes users directly to outdoor conditions, temperature is one of the strongest determinants of demand. As expected, temp and atemp are almost perfectly correlated. So, either can be used.

Humidity displays a right-skewed distribution, and higher humidity generally suppresses demand, likely due to discomfort and its association with poor weather.

Windspeed shows an unusually large number of zero values. In reality, sustained zero windspeed is extremely rare, and the fact that zero is the modal value strongly suggests that unmeasured or missing observations were recorded as zero.

For this reason, I will wtreat windspeed values equal to zero as missing values and will impute them using KNN imputation.

KNN imputation for windspeed value = 0

train_imp <- train %>%
  mutate(windspeed = ifelse(windspeed == 0, NA, windspeed))

imp_data <- train_imp %>%
  mutate(
    season_num   = as.numeric(season),
    weather_num  = as.numeric(weather),
    year_num     = as.numeric(year),
    month_num    = as.numeric(month),
    hour_num     = as.numeric(hour)
  ) %>%
  dplyr::select(
    windspeed, temp, atemp, humidity,
    season_num, weather_num, year_num, month_num, hour_num
  )

imp_result <- kNN(
  imp_data,
  variable = "windspeed",
  k = 5,
  imp_var = FALSE
)
##        temp       atemp    humidity  season_num weather_num    year_num 
##       0.820       0.760       0.000       1.000       1.000       1.000 
##   month_num    hour_num        temp       atemp    humidity  season_num 
##       1.000       1.000      41.000      45.455     100.000       4.000 
## weather_num    year_num   month_num    hour_num 
##       4.000       2.000      12.000      24.000
train_imp$windspeed <- imp_result$windspeed
p_wind  <- ggplot(train, aes(windspeed)) + geom_histogram(bins=500, fill="red")+
  labs(title="Before KNN Imputation (0 → NA)")

p_new <- ggplot(train_imp, aes(windspeed)) +
  geom_histogram(bins = 500, fill="blue") +
  labs(title="After KNN Imputation")

gridExtra::grid.arrange(p_wind, p_new, ncol=2)

After imputation, the artificial spike at zero disappears, producing a more realistic windspeed distribution and preventing distortion in downstream models.

This correction would prevent distorted model behaviour, especially in linear and Poisson regression where extreme zeros can bias coefficient estimates, and in Random Forest models where such spikes degrade split purity.

year_summary <- train_imp %>%
  group_by(year) %>%
  summarise(
    mean_count = mean(count),
    sd_count   = sd(count),
    n          = n(),
    se_count   = sd_count / sqrt(n)
  )

p1<- ggplot(year_summary, aes(x = factor(year), y = mean_count)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean_count - se_count,
                    ymax = mean_count + se_count),
                width = 0.2) +
  labs(title = "Average count by year", x = "year", y = "count")
month_summary <- train_imp %>%
  group_by(month) %>%
  summarise(
    mean_count = mean(count),
    sd_count   = sd(count),
    n          = n(),
    se_count   = sd_count / sqrt(n)
  )

p2<- ggplot(month_summary, aes(x = factor(month), y = mean_count)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean_count - se_count,
                    ymax = mean_count + se_count),
                width = 0.2) +
  labs(title = "Average count by month", x = "month", y = "count")
year_month_summary <- train_imp %>%
  mutate(
    year_num  = as.integer(as.character(year)),   
    month_num = as.integer(month), 
    year_month = sprintf("%d-%02d", year_num, month_num)  
  ) %>%
  group_by(year_month) %>%
  summarise(
    mean_count = mean(count),
    sd_count   = sd(count),
    n          = n(),
    se_count   = sd_count / sqrt(n),
    .groups = "drop"
  )

p3<- ggplot(
  year_month_summary,
  aes(
    x = factor(year_month, levels = sort(unique(year_month))),
    y = mean_count
  )
) +
  geom_col() +
  geom_errorbar(
    aes(
      ymin = mean_count - se_count,
      ymax = mean_count + se_count
    ),
    width = 0.2
  ) +
  labs(
    title = "Average count by year-month",
    x = "year_month",
    y = "count"
  ) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
(p1+p2)/p3

Finally, examining demand by year and month revealed a clear increase in usage from 2011 to 2012, indicating system growth. Monthly patterns show strong seasonality.

day_summary <- train_imp %>%
  group_by(day) %>%
  summarise(
    mean_count = mean(count),
    sd_count   = sd(count),
    n          = n(),
    se_count   = sd_count / sqrt(n),
    .groups = "drop"
  )

p4 <- ggplot(day_summary, aes(x = day, y = mean_count)) +
  geom_col(fill = "grey40") +
  geom_errorbar(
    aes(ymin = mean_count - se_count,
        ymax = mean_count + se_count),
    width = 0.2
  ) +
  labs(
    title = "Average count by day of month (1–31)",
    x = "day",
    y = "count"
  ) +
  theme_bw()

p4

The day-of-month does not effectively explain demand. This variable was excluded from modelling.

Modelling

set.seed(123)


split <- initial_split(train_imp, prop = 0.75)
train_set <- training(split)
test_set  <- testing(split)

train_set <- train_set %>%
  mutate(
    hour_bin = case_when(
      hour %in% c(0,1,2,3,4,5)   ~ "late_night",
      hour %in% c(6,7,8,9)       ~ "morning_peak",
      hour %in% c(10,11,12,13)   ~ "midday",
      hour %in% c(14,15,16)      ~ "afternoon",
      hour %in% c(17,18,19)      ~ "evening_peak",
      TRUE                       ~ "night"
    ),
    hour_bin = factor(hour_bin,
        levels = c("late_night","morning_peak","midday",
                   "afternoon","evening_peak","night"))
  )

test_set <- test_set %>%
  mutate(
    hour_bin = case_when(
      hour %in% c(0,1,2,3,4,5)   ~ "late_night",
      hour %in% c(6,7,8,9)       ~ "morning_peak",
      hour %in% c(10,11,12,13)   ~ "midday",
      hour %in% c(14,15,16)      ~ "afternoon",
      hour %in% c(17,18,19)      ~ "evening_peak",
      TRUE                       ~ "night"
    ),
    hour_bin = factor(hour_bin,
        levels = c("late_night","morning_peak","midday",
                   "afternoon","evening_peak","night"))
  )
# Baseline Linear Regression with lambda 0.3
bike_rec <- recipe(bc_count ~ ., data = train_set) %>%
  step_rm(datetime, casual, registered, count, log_count, season,
          day, hour, wday, wday_num) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%       
  step_normalize(all_predictors())
bike_prep   <- prep(bike_rec)

train_baked <- bake(bike_prep, new_data = NULL)      
test_baked  <- bake(bike_prep, new_data = test_set)  
lm_bc <- lm(bc_count ~ ., data = train_baked)
summary(lm_bc)
## 
## Call:
## lm(formula = bc_count ~ ., data = train_baked)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9375  -1.6874  -0.1061   1.7547   9.0785 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             10.95686    0.02913 376.144  < 2e-16 ***
## temp                     0.81772    0.19603   4.171 3.06e-05 ***
## atemp                    0.53762    0.17746   3.030 0.002457 ** 
## humidity                -0.55380    0.04001 -13.841  < 2e-16 ***
## windspeed               -0.11122    0.03280  -3.391 0.000699 ***
## holiday_Yes             -0.05062    0.03052  -1.658 0.097256 .  
## workingday_Yes          -0.04898    0.03015  -1.625 0.104253    
## weather_Mist            -0.05955    0.03159  -1.885 0.059485 .  
## weather_Light.Snow.Rain -0.48808    0.03260 -14.970  < 2e-16 ***
## year_X2012               0.93090    0.02967  31.374  < 2e-16 ***
## month_Feb                0.13091    0.04033   3.246 0.001175 ** 
## month_Mar                0.16197    0.04293   3.773 0.000162 ***
## month_Apr                0.32553    0.04531   7.185 7.30e-13 ***
## month_May                0.53893    0.05195  10.375  < 2e-16 ***
## month_Jun                0.36724    0.05923   6.200 5.92e-10 ***
## month_Jul                0.14072    0.06519   2.159 0.030898 *  
## month_Aug                0.24098    0.06401   3.765 0.000168 ***
## month_Sep                0.46658    0.05718   8.160 3.85e-16 ***
## month_Oct                0.66251    0.04904  13.510  < 2e-16 ***
## month_Nov                0.65037    0.04202  15.479  < 2e-16 ***
## month_Dec                0.65642    0.04187  15.676  < 2e-16 ***
## hour_bin_morning_peak    2.83957    0.03453  82.237  < 2e-16 ***
## hour_bin_midday          2.79471    0.03808  73.392  < 2e-16 ***
## hour_bin_afternoon       2.63601    0.03912  67.380  < 2e-16 ***
## hour_bin_evening_peak    3.51585    0.03724  94.401  < 2e-16 ***
## hour_bin_night           2.31663    0.03541  65.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.632 on 8138 degrees of freedom
## Multiple R-squared:  0.7389, Adjusted R-squared:  0.7381 
## F-statistic: 921.1 on 25 and 8138 DF,  p-value: < 2.2e-16
lm_step <- stats::step(lm_bc, direction = "both")
## Start:  AIC=15827.17
## bc_count ~ temp + atemp + humidity + windspeed + holiday_Yes + 
##     workingday_Yes + weather_Mist + weather_Light.Snow.Rain + 
##     year_X2012 + month_Feb + month_Mar + month_Apr + month_May + 
##     month_Jun + month_Jul + month_Aug + month_Sep + month_Oct + 
##     month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday + 
##     hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night
## 
##                           Df Sum of Sq    RSS   AIC
## <none>                                  56375 15827
## - workingday_Yes           1        18  56393 15828
## - holiday_Yes              1        19  56394 15828
## - weather_Mist             1        25  56399 15829
## - month_Jul                1        32  56407 15830
## - atemp                    1        64  56438 15834
## - month_Feb                1        73  56448 15836
## - windspeed                1        80  56454 15837
## - month_Aug                1        98  56473 15839
## - month_Mar                1        99  56473 15839
## - temp                     1       121  56495 15843
## - month_Jun                1       266  56641 15864
## - month_Apr                1       358  56732 15877
## - month_Sep                1       461  56836 15892
## - month_May                1       746  57120 15932
## - month_Oct                1      1264  57639 16006
## - humidity                 1      1327  57702 16015
## - weather_Light.Snow.Rain  1      1552  57927 16047
## - month_Nov                1      1660  58034 16062
## - month_Dec                1      1702  58077 16068
## - year_X2012               1      6819  63194 16757
## - hour_bin_night           1     29651  86025 19276
## - hour_bin_afternoon       1     31450  87825 19444
## - hour_bin_midday          1     37313  93687 19972
## - hour_bin_morning_peak    1     46849 103223 20763
## - hour_bin_evening_peak    1     61734 118108 21863
summary(lm_step)
## 
## Call:
## lm(formula = bc_count ~ temp + atemp + humidity + windspeed + 
##     holiday_Yes + workingday_Yes + weather_Mist + weather_Light.Snow.Rain + 
##     year_X2012 + month_Feb + month_Mar + month_Apr + month_May + 
##     month_Jun + month_Jul + month_Aug + month_Sep + month_Oct + 
##     month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday + 
##     hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night, 
##     data = train_baked)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9375  -1.6874  -0.1061   1.7547   9.0785 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             10.95686    0.02913 376.144  < 2e-16 ***
## temp                     0.81772    0.19603   4.171 3.06e-05 ***
## atemp                    0.53762    0.17746   3.030 0.002457 ** 
## humidity                -0.55380    0.04001 -13.841  < 2e-16 ***
## windspeed               -0.11122    0.03280  -3.391 0.000699 ***
## holiday_Yes             -0.05062    0.03052  -1.658 0.097256 .  
## workingday_Yes          -0.04898    0.03015  -1.625 0.104253    
## weather_Mist            -0.05955    0.03159  -1.885 0.059485 .  
## weather_Light.Snow.Rain -0.48808    0.03260 -14.970  < 2e-16 ***
## year_X2012               0.93090    0.02967  31.374  < 2e-16 ***
## month_Feb                0.13091    0.04033   3.246 0.001175 ** 
## month_Mar                0.16197    0.04293   3.773 0.000162 ***
## month_Apr                0.32553    0.04531   7.185 7.30e-13 ***
## month_May                0.53893    0.05195  10.375  < 2e-16 ***
## month_Jun                0.36724    0.05923   6.200 5.92e-10 ***
## month_Jul                0.14072    0.06519   2.159 0.030898 *  
## month_Aug                0.24098    0.06401   3.765 0.000168 ***
## month_Sep                0.46658    0.05718   8.160 3.85e-16 ***
## month_Oct                0.66251    0.04904  13.510  < 2e-16 ***
## month_Nov                0.65037    0.04202  15.479  < 2e-16 ***
## month_Dec                0.65642    0.04187  15.676  < 2e-16 ***
## hour_bin_morning_peak    2.83957    0.03453  82.237  < 2e-16 ***
## hour_bin_midday          2.79471    0.03808  73.392  < 2e-16 ***
## hour_bin_afternoon       2.63601    0.03912  67.380  < 2e-16 ***
## hour_bin_evening_peak    3.51585    0.03724  94.401  < 2e-16 ***
## hour_bin_night           2.31663    0.03541  65.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.632 on 8138 degrees of freedom
## Multiple R-squared:  0.7389, Adjusted R-squared:  0.7381 
## F-statistic: 921.1 on 25 and 8138 DF,  p-value: < 2.2e-16
plot(lm_step)

vif(lm_step)
##                    temp                   atemp                humidity 
##               45.283717               37.108725                1.886645 
##               windspeed             holiday_Yes          workingday_Yes 
##                1.267428                1.097731                1.070974 
##            weather_Mist weather_Light.Snow.Rain              year_X2012 
##                1.176206                1.252593                1.037390 
##               month_Feb               month_Mar               month_Apr 
##                1.916405                2.171410                2.418693 
##               month_May               month_Jun               month_Jul 
##                3.179689                4.134218                5.007400 
##               month_Aug               month_Sep               month_Oct 
##                4.828409                3.852621                2.833805 
##               month_Nov               month_Dec   hour_bin_morning_peak 
##                2.080328                2.066147                1.404942 
##         hour_bin_midday      hour_bin_afternoon   hour_bin_evening_peak 
##                1.708697                1.803512                1.634517 
##          hour_bin_night 
##                1.477497
lm_no_atemp <- lm(
  bc_count ~ temp + humidity + windspeed + 
    holiday_Yes + workingday_Yes + weather_Mist + 
    year_X2012 + month_Feb + month_Mar + month_Apr + month_May + 
    month_Jun + month_Jul + month_Aug + month_Sep + month_Oct + 
    month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday + 
    hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night, 
    data = train_baked
)


lm_step_no_atemp <- stats::step(lm_no_atemp, direction = "both")
## Start:  AIC=16060.1
## bc_count ~ temp + humidity + windspeed + holiday_Yes + workingday_Yes + 
##     weather_Mist + year_X2012 + month_Feb + month_Mar + month_Apr + 
##     month_May + month_Jun + month_Jul + month_Aug + month_Sep + 
##     month_Oct + month_Nov + month_Dec + hour_bin_morning_peak + 
##     hour_bin_midday + hour_bin_afternoon + hour_bin_evening_peak + 
##     hour_bin_night
## 
##                         Df Sum of Sq    RSS   AIC
## <none>                                58035 16060
## - holiday_Yes            1        17  58052 16060
## - month_Jul              1        25  58060 16062
## - workingday_Yes         1        38  58073 16064
## - weather_Mist           1        39  58074 16064
## - month_Feb              1        57  58091 16066
## - month_Aug              1        83  58117 16070
## - month_Mar              1        95  58130 16072
## - month_Jun              1       248  58282 16093
## - windspeed              1       254  58288 16094
## - month_Apr              1       328  58363 16104
## - month_Sep              1       458  58492 16122
## - month_May              1       747  58782 16163
## - month_Oct              1      1269  59304 16235
## - month_Nov              1      1734  59769 16298
## - month_Dec              1      1856  59891 16315
## - temp                   1      2875  60909 16453
## - humidity               1      3213  61248 16498
## - year_X2012             1      6684  64719 16948
## - hour_bin_night         1     28866  86900 19354
## - hour_bin_afternoon     1     30046  88081 19464
## - hour_bin_midday        1     36051  94086 20003
## - hour_bin_morning_peak  1     46446 104480 20858
## - hour_bin_evening_peak  1     60147 118182 21864
summary(lm_step_no_atemp)
## 
## Call:
## lm(formula = bc_count ~ temp + humidity + windspeed + holiday_Yes + 
##     workingday_Yes + weather_Mist + year_X2012 + month_Feb + 
##     month_Mar + month_Apr + month_May + month_Jun + month_Jul + 
##     month_Aug + month_Sep + month_Oct + month_Nov + month_Dec + 
##     hour_bin_morning_peak + hour_bin_midday + hour_bin_afternoon + 
##     hour_bin_evening_peak + hour_bin_night, data = train_baked)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6796  -1.6539  -0.0954   1.7717   8.8316 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.95686    0.02955 370.771  < 2e-16 ***
## temp                   1.39497    0.06947  20.080  < 2e-16 ***
## humidity              -0.78758    0.03710 -21.229  < 2e-16 ***
## windspeed             -0.19347    0.03243  -5.967 2.53e-09 ***
## holiday_Yes           -0.04789    0.03092  -1.549 0.121423    
## workingday_Yes        -0.07060    0.03053  -2.312 0.020780 *  
## weather_Mist           0.07218    0.03075   2.347 0.018949 *  
## year_X2012             0.92120    0.03009  30.618  < 2e-16 ***
## month_Feb              0.11517    0.04088   2.817 0.004855 ** 
## month_Mar              0.15895    0.04354   3.650 0.000263 ***
## month_Apr              0.31144    0.04592   6.782 1.27e-11 ***
## month_May              0.53951    0.05270  10.238  < 2e-16 ***
## month_Jun              0.35339    0.05996   5.894 3.93e-09 ***
## month_Jul              0.12429    0.06586   1.887 0.059184 .  
## month_Aug              0.21898    0.06427   3.407 0.000660 ***
## month_Sep              0.46358    0.05787   8.011 1.29e-15 ***
## month_Oct              0.66377    0.04975  13.342  < 2e-16 ***
## month_Nov              0.66465    0.04261  15.597  < 2e-16 ***
## month_Dec              0.68440    0.04241  16.136  < 2e-16 ***
## hour_bin_morning_peak  2.82614    0.03501  80.713  < 2e-16 ***
## hour_bin_midday        2.72884    0.03838  71.110  < 2e-16 ***
## hour_bin_afternoon     2.54924    0.03927  64.918  < 2e-16 ***
## hour_bin_evening_peak  3.43858    0.03744  91.849  < 2e-16 ***
## hour_bin_night         2.28008    0.03583  63.630  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.67 on 8140 degrees of freedom
## Multiple R-squared:  0.7312, Adjusted R-squared:  0.7304 
## F-statistic: 962.7 on 23 and 8140 DF,  p-value: < 2.2e-16
plot(lm_step_no_atemp)

inv_boxcox <- function(y_bc, lambda) {
  ((y_bc * lambda) + 1)^(1 / lambda) - 1
}

pred_bc  <- predict(lm_step_no_atemp, newdata = test_baked)
pred_cnt <- inv_boxcox(pred_bc, lambda)

rmsle <- function(pred, actual) {
  sqrt(mean((log(pred + 1) - log(actual + 1))^2))
}

rmsle_lm_bc <- rmsle(pred_cnt, test_set$count)
rmsle_lm_bc
## [1] 0.728459

The first model fitted was a linear regression using the Box–Cox–transformed response (bc_count). To reduce multicollinearity and prevent overfitting due to excessive dummy variables, I grouped the 24-hour into six broader time bins representing distinct behavioural periods.

The baseline linear model achieved an adjusted R^2 of approximately 0.73. However, diagnostic plots revealed clear heteroscedasticity. As fitted values increased, the variance of residuals also increased, violating the constant variance assumption. When evaluated using RMSLE—the competition metric for this Kaggle task, the linear model achieved a score of 0.728.

# poisson model
# prep
rec_cnt <- recipe(count ~ ., data = train_set) %>%
  step_rm(
    datetime,     
    casual, registered,  
    log_count, bc_count, 
    season, day, hour,        
    wday, wday_num
  ) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

prep_cnt <- prep(rec_cnt)

train_baked_cnt <- bake(prep_cnt, new_data = NULL)
test_baked_cnt  <- bake(prep_cnt, new_data = test_set)
# modelling
pois_mod <- glm(
  count ~ .,
  data = train_baked_cnt,
  family = poisson(link = "log")
)

summary(pois_mod)
## 
## Call:
## glm(formula = count ~ ., family = poisson(link = "log"), data = train_baked_cnt)
## 
## Coefficients:
##                           Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)              4.8112046  0.0013249 3631.306   <2e-16 ***
## temp                     0.1179154  0.0050568   23.318   <2e-16 ***
## atemp                    0.1348716  0.0046365   29.089   <2e-16 ***
## humidity                -0.0949470  0.0011546  -82.234   <2e-16 ***
## windspeed               -0.0090818  0.0008822  -10.294   <2e-16 ***
## holiday_Yes             -0.0099195  0.0008754  -11.332   <2e-16 ***
## workingday_Yes           0.0069077  0.0008292    8.331   <2e-16 ***
## weather_Mist            -0.0161882  0.0008890  -18.209   <2e-16 ***
## weather_Light.Snow.Rain -0.1114690  0.0010938 -101.906   <2e-16 ***
## year_X2012               0.2219856  0.0008352  265.802   <2e-16 ***
## month_Feb                0.0375297  0.0015396   24.377   <2e-16 ***
## month_Mar                0.0723670  0.0015302   47.292   <2e-16 ***
## month_Apr                0.1055305  0.0015343   68.783   <2e-16 ***
## month_May                0.1457614  0.0016557   88.035   <2e-16 ***
## month_Jun                0.1218011  0.0018115   67.238   <2e-16 ***
## month_Jul                0.0663951  0.0019722   33.665   <2e-16 ***
## month_Aug                0.0955963  0.0019241   49.683   <2e-16 ***
## month_Sep                0.1401174  0.0017661   79.335   <2e-16 ***
## month_Oct                0.1789193  0.0015671  114.172   <2e-16 ***
## month_Nov                0.1664481  0.0014164  117.517   <2e-16 ***
## month_Dec                0.1694252  0.0014355  118.029   <2e-16 ***
## hour_bin_morning_peak    0.8168870  0.0018047  452.643   <2e-16 ***
## hour_bin_midday          0.7525513  0.0018640  403.723   <2e-16 ***
## hour_bin_afternoon       0.6950053  0.0017044  407.772   <2e-16 ***
## hour_bin_evening_peak    0.8603717  0.0016529  520.513   <2e-16 ***
## hour_bin_night           0.6526048  0.0018674  349.481   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1340258  on 8163  degrees of freedom
## Residual deviance:  382995  on 8138  degrees of freedom
## AIC: 435224
## 
## Number of Fisher Scoring iterations: 5
overdisp <- sum(residuals(pois_mod, type = "pearson")^2) / pois_mod$df.residual
overdisp 
## [1] 49.11274
nb_mod <- glm.nb(
  count ~ .,
  data = train_baked_cnt
)

summary(nb_mod)
## 
## Call:
## glm.nb(formula = count ~ ., data = train_baked_cnt, init.theta = 2.429944107, 
##     link = log)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              4.795439   0.007244 661.986  < 2e-16 ***
## temp                     0.208799   0.048528   4.303 1.69e-05 ***
## atemp                    0.097638   0.043937   2.222 0.026268 *  
## humidity                -0.115533   0.009949 -11.612  < 2e-16 ***
## windspeed               -0.027162   0.008134  -3.339 0.000840 ***
## holiday_Yes             -0.031503   0.007582  -4.155 3.25e-05 ***
## workingday_Yes          -0.114657   0.007477 -15.335  < 2e-16 ***
## weather_Mist            -0.010592   0.007848  -1.350 0.177096    
## weather_Light.Snow.Rain -0.113198   0.008168 -13.858  < 2e-16 ***
## year_X2012               0.224714   0.007368  30.500  < 2e-16 ***
## month_Feb                0.034124   0.010143   3.364 0.000767 ***
## month_Mar                0.047461   0.010765   4.409 1.04e-05 ***
## month_Apr                0.081601   0.011336   7.198 6.10e-13 ***
## month_May                0.141130   0.012957  10.892  < 2e-16 ***
## month_Jun                0.102164   0.014752   6.925 4.35e-12 ***
## month_Jul                0.064226   0.016227   3.958 7.56e-05 ***
## month_Aug                0.073277   0.015932   4.599 4.24e-06 ***
## month_Sep                0.126700   0.014242   8.896  < 2e-16 ***
## month_Oct                0.166584   0.012227  13.624  < 2e-16 ***
## month_Nov                0.167215   0.010493  15.935  < 2e-16 ***
## month_Dec                0.174362   0.010459  16.671  < 2e-16 ***
## hour_bin_morning_peak    0.862512   0.008657  99.628  < 2e-16 ***
## hour_bin_midday          0.753546   0.009529  79.081  < 2e-16 ***
## hour_bin_afternoon       0.696042   0.009747  71.411  < 2e-16 ***
## hour_bin_evening_peak    0.874560   0.009278  94.257  < 2e-16 ***
## hour_bin_night           0.664410   0.008885  74.781  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.4299) family taken to be 1)
## 
##     Null deviance: 26393.6  on 8163  degrees of freedom
## Residual deviance:  8886.1  on 8138  degrees of freedom
## AIC: 92181
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.4299 
##           Std. Err.:  0.0387 
## 
##  2 x log-likelihood:  -92127.4900

Next, a Poisson regression model was fitted on the original count scale. A formal overdispersion test yielded a value close to 49, indicating that the variance far exceeded the mean and that the Poisson assumption was severely violated. As a result, a Negative Binomial (NB) model was fitted instead, with an estimated dispersion parameter theta ~= 2.43.

plot(nb_mod)

overdisp_nb <- sum(residuals(nb_mod, type="pearson")^2) / nb_mod$df.residual
overdisp_nb
## [1] 1.03498
pred_nb <- predict(nb_mod, newdata = test_baked_cnt, type = "response")
rmsle_nb <- rmsle(pred_nb, test_set$count)
rmsle_nb
## [1] 0.7502783

The NB model captured the overall structure of the data reasonably well, identifying the correct directional effects of weather, seasonality, and time. However, it continued to struggle with extreme high-demand periods, showing a heavy right-tail mismatch and mild heteroscedasticity. Although the NB model serves as a solid statistical baseline, its RMSLE of approximately 0.75 was worse than that of the linear model. This performance gap is likely due to the very high variance in peak demand hours. Even with an extra dispersion parameter, the NB model remains constrained by its parametric form and cannot flexibly adapt to sharp demand spikes.

rf_rec <- recipe(count ~ temp + atemp + humidity + windspeed +
                   holiday + workingday + weather +
                   year + month + hour,
                 data = train_imp) %>%
  step_dummy(all_nominal_predictors())

prep_rf <- prep(rf_rec)

train_baked_rf <- bake(prep_rf, new_data = NULL)
test_baked_rf  <- bake(prep_rf, new_data = test_set)

# random forest


set.seed(123)

rf_basic <- randomForest(
  count ~ .,
  data      = train_baked_rf,
  ntree     = 500,
  mtry      = 3,
  importance = TRUE
)

rf_basic
## 
## Call:
##  randomForest(formula = count ~ ., data = train_baked_rf, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 8933.908
##                     % Var explained: 72.77
pred_basic <- predict(rf_basic, newdata = test_baked_rf)
rmsle_rf_basic <- rmsle(pred_basic, test_set$count)
rmsle_rf_basic
## [1] 0.9401285

0.94

#tuning

K <- 5
fold_id <- sample(rep(1:K, length.out = nrow(train_baked_rf)))

param_grid <- expand.grid(
  mtry     = c(3, 5, 7, 9),
  nodesize = c(5, 15, 30),
  ntree    = c(300, 500, 800)
)

results <- param_grid
results$cv_rmsle <- NA_real_

#for (i in seq_len(nrow(param_grid))) {
#  p <- param_grid[i, ]
#  cv_scores <- numeric(K)
#  
#  for (k in 1:K) {
#    train_fold <- train_baked_rf[fold_id != k, ]
#    valid_fold <- train_baked_rf[fold_id == k, ]
#    
#    fit <- randomForest(
#      count ~ .,
#      data     = train_fold,
#      ntree    = p$ntree,
#      mtry     = p$mtry,
#      nodesize = p$nodesize
#    )
    
#    pred <- predict(fit, newdata = valid_fold)
#    cv_scores[k] <- rmsle(pred, valid_fold$count)
#  }
  
#  results$cv_rmsle[i] <- mean(cv_scores)
#  cat("Done:", i, "of", nrow(param_grid),
#      "| mtry =", p$mtry,
#      "nodesize =", p$nodesize,
#      "ntree =", p$ntree,
#      "| CV RMSLE =", round(results$cv_rmsle[i], 4), "\n")
#}

#results <- results[order(results$cv_rmsle), ]
#head(results, 5) 
best <- results[1, ]
best
##   mtry nodesize ntree cv_rmsle
## 1    3        5   300       NA
rf_best <- randomForest(
  count ~ .,
  data     = train_baked_rf,
  ntree    = 500,#best$ntree,
  mtry     = 9,#best$mtry,
  nodesize = 5,#best$nodesize,
  importance = TRUE
)

pred_test <- predict(rf_best, newdata = test_baked_rf)
rmsle_rf_best <- rmsle(pred_test, test_set$count)
rmsle_rf_best
## [1] 0.4310566

Finally, a Random Forest regression model was applied to model nonlinear relationships and complex interactions. The untuned baseline model achieved an R^2 of approximately 0.73 and an RMSLE of 0.94, indicating suboptimal default hyperparameters. After systematic tuning, the best-performing configuration was obtained with mtry = 9, nodesize = 5, and ntree = 500, yielding an RMSLE of 0.43. This substantial improvement confirms that demand is driven by highly nonlinear interactions between hour, months and year, and weather conditions, which are better captured by ensemble tree-based methods than by parametric models.

varImpPlot(rf_best)

Vairable importance was evaluated using the mean decrease in prediction error across trees, reflecting how much each variable contributes to reducing overall forecasting error.

The most influential predictor by a large margin is hour of day. This confirms that hourly demand is fundamentally structured by daily human activity cycles. In particular, commuting behavior during morning and evening peak hours dominates overall usage patterns. The model relies heavily on this variable to separate low-demand nighttime periods from high-demand commuting periods, making hour the single most critical feature.

The year variable also shows high importance since the demand drastically grew from the period of 2011 to 2012.

Workingday, Month, temperature, humidity and wind are other important factors.

Overall, the variable importance results reinforce a clear hierarchy. Time-driven behavioral patterns dominate bike-sharing demand, while weather conditions act as modifiers on top of this temporal backbone. This explains why Random Forest, which naturally captures interactions between time and weather, substantially outperforms linear and count-based models.

Conclusion

In this project, I examined hourly bike-sharing demand by combining detailed exploratory analysis with a range of statistical models: Linear, Poisson, Negative Binomial and an ensenble model, Random Forest. The analysis confirms that demand is primarily governed by strong temporal structure, which are hour of day, workingday status, and long-term calendar effects define the fundamental shape of usage. Weather conditions such as temperature, humidity, and windspeed consistently influence demand, but they act mainly as modifiers rather than core drivers.

From a modelling perspective, linear regression with a Box–Cox transformation provides a strong and interpretable baseline, capturing the main directional effects of weather and seasonality. Negative Binomial regression improves upon the Poisson model by addressing overdispersion, but remains limited by its parametric form and struggles with extreme peak-demand periods. In contrast, the tuned Random Forest model substantially outperforms all parametric approaches, achieving the lowest RMSLE by effectively learning nonlinear interactions and sharp demand spikes associated with commuting behavior.

Overall, the results suggest that accurate demand forecasting requires models that can accommodate both strong periodic structure and context-dependent nonlinear effects. For operational use, the tuned Random Forest provides a reliable forecasting tool, while simpler parametric models remain valuable for interpretation and understanding underlying demand drivers.